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Abstract 

This paper investigates the use of redundancy and self repairing against node failures in distributed storage systems, using 
various strategies. In replication method, access to one replication node is sufficient to reconstruct a lost node, while in MDS erasure 
coded systems which are optimal in terms of redundancy-reliability tradeoff, a single node failure is repaired after recovering 
the entire stored data. Moreover, regenerating codes yield a tradeoff curve between storage capacity and repair bandwidth. The 
current paper aims at investigating a new storage code. Specifically, we propose a non-MDS {2k, k) code that tolerates any three 
node failures and more importantly, it is shown using our code a single node failure can be repaired through access to only three 
nodes. 

I. Introduction 

The field of large scale data storage has witnessed significant growth in recent years with applications such as social networks 
and file sharing. In Storage systems, data should be stored over multiple nodes (independent storage devices such as disks, 
servers or peers) and it may happen a storage node is failed or leaves the system. Thus, a reliable storage capability over 
individually unreliable nodes can be achieved through introducing redundancy. 

There are various strategies for distributing redundancy and depending on the used method the system can tolerate a limited 
number of node failures. Moreover, to keep the redundancy the same as if there is no node failures, the system should have self- 
repairing capability. In other words, each damaged node is replaced with a new node after transferring data over the network. 
Reconstructing a failed node and the maintenance bandwidth are called repair problem and repair bandwidth, respectively. 

Erasure codes are the most common strategy for distributing redundancy. An erasure coded system employs totally n packets 
of the same size, k of which are data packets (the fragments of the original data file) and n — k of which are parity packets 
(the coding information). It is worth mentioning that the process of coding can be done using MDS or non-MDS codes. In 
a distributed storage system, these packets are stored at n different nodes over the network. MDS codes |1| are optimally 
space-efficient and the encoding process is such that having access to any k nodes is adequate to recover the original data file. 
In these codes, each parity node increases fault tolerance. In other words, a (n, k) MDS coded system can tolerate any n — k 
node failures. 

Replication, RAID 5, RAID 6 |2|, and Reed-Solomon codes |3| are the most popular MDS codes that have been used in 
storage systems. In replication, the parity nodes and data nodes are the same. In fact, each data node has a replica which is 
stored in a related parity node. RAID 5 and RAID 6 employ n — k — 1 and n — k = 2 parity node respectively; however, 
Reed-Solomon codes can be designed for any value of (n, /c) L2J. Another class of MDS codes are MDS array codes such 
as EVENODD |4|, extended EVENODD (H, Row-Diagonal Parity (RDP) ||6l, X-code Q, P-code El, B-code ||3, and STAR 
code |10|. These codes are based on XOR operation and have lower encoding and decoding complexity than Reed-Solomon 
codes. 

In lITTIl . Low Density Parity-Check (LDPC) codes as a class of non-MDS codes are introduced. These codes aim at reducing 
encoding and decoding costs computation over lossy networks; however, are not as space-efficient as MDS codes. Non-MDS 
codes are further investigated in several papers. As a case in point, Hafner in [121 proposes a new class of non-MDS XOR- 
based codes, called WEAVER codes. The WEAVER codes are vertical codes which can tolerate up to 12 node failures. In a 
vertical code like X-code and WEAVER code each node contains both data and parity packets. In contrast, each node in a 
flat-XOR code such as EVENODD, holds either data or parity packets. The authors in |13 | describe construction of two novel 
flat XOR-based code, called stepped combination and HD-combination codes. Also in |13|, chain codes, a variant of chained 
configuration method fTT], are investigated. 

The standard MDS codes in terms of repair problem are inefficient and recreating a failed node consumes a repair bandwidth 
equal the entire stored data. This motivated Dimakis et al. in ifTSl to propose a repair optimal MDS code, called regenerating 
code, to make a tradeoff between repair bandwidth and storage capacity per node. It is shown in |15 | that any point on the 
identified tradeoff curve can be achieved through the use of network coding. Furthermore, in |16 | an extension of regenerating 
codes, dubbed generalized regenerating codes, are introduced for the case of having different download cost associated with 
each node. Moreover, the authors in (TT\ investigate the case in which the newcomer node can wisely select the existing node 
to connect to. 
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Fig. 1. The graphical representation of the proposed code. The recovery of original data file can be achieved by connecting to: (i) two nodes from a partition 
and k — 2 different nodes selected over k — 2 different partitions (solid-lines) and (ii) 2m < k parity nodes and k — 2m systematic nodes selected from k 
different partitions (dashed-lines are the specific case m = 0). 



The repair model presented in (15] is a functional repair. In the functional repair model the recreated packets stored at 
replaced node can be different with the lost packets. Contrast the functional repair with the exact repair in which each lost 
packet is exactly reconstructed. The exact repair for the minimum bandwidth regenerating codes is investigated in ifTSl . Also 
in 1 19], 1 20 1, [21 1, the exact repair for the minimum storage regenerating codes is addressed based upon the interference 
alignment concepts. 

Regenerating codes outperform existing MDS erasure codes in terms of maintenance bandwidth; however, the constructing a 
new packet requires communication with d > k nodes and the minimum repair bandwidth can be achieved when d = n — 1. In 
addition, the surviving nodes have to apply random linear network coding to their packets. Accordingly, many of the proposed 
constructions require a huge finite-field size and are not feasible for practical storage systems. The current study aims to 
introduce a (n, k) = (2/c, k) non-MDS XOR-based code which can tolerate any three node failures. Accordingly, it is shown 
in this code a single node failure can be repaired through access to only three nodes regardless of k. 

The rest of paper is organized as follows: Section HIl states the construction and motivates the main idea. In section |llll we 
explain the repair problem of the proposed code. Finally, sections [IVl concludes the paper. 

II. Construction 

In this section we describe the construction of the proposed non-MDS code. Fig. [T] shows a graphical representation for this 
code. This code is a class of flat XOR-codes which contains 2k storage nodes and each node stores one packet. The construction 
is such that k out of 2k existing nodes i.e., {Si}i=i^,,,^k, hold data fragments, called systematic nodes. The remaining k nodes, 
i.e.,{Pi}i=i,...,/c, are the parity nodes which store parity packets. Also it is assumed that each systematic node (Si) has a 
related parity node {Pi) in which they stand in a same partition. Thus, with this construction, the code entails k partition. 

For storing a file of size M using this construction, the file is divided to k fragments i.e., di, (i2, <^3, • • • , <^/c, each of size 
Each fragment can be a single bit or a block of bits. These fragments are stored at k systematic nodes. Fig. [2] illustrates a 
(n, k) = (10, 5) code corresponding to the explained construction. Referring to Fig.O the five data fragments, i.e., (ii, 6/2, <^3, <^4 
and 6^5, are stored at nodes Si, Ss, S4 and ^'5 respectively. Noting the parity packet Pi to be stored in parity node Pi is 
computed as 

k 

Pi= di , (1) 

j=l, i 

For i = 1, . . . , /c. The addition here is bit-by-bit XOR for two data packets. For instance in a (10,5) code, as can be seen in 
Fig. [2I parity packets pi = d2 ds d^ ^5, p2 = + + + <^5, = (ii + <i2 + + <^5, P4 = + + + 
and p5 = + (^2 + + d^ are stored in parity nodes Pi , P2 , ^3 , P4 and P5 respectively. It is worth mentioning that for 
specific case k = 2 this code performs similar to replication method. Also for = 3 the parity packets are same with the 
parity packets of the proposed chain code in lfT3]| . Now we are ready to discuss how the recovery of the original file can be 
achieved. It is assumed corresponding to a request to reconstructing the original data file a Data Collector (DC) is initiated 
and connects to existing nodes. With this construction, DC requires to connect to at least k out of existing nodes. Recall the 
proposed construction in this paper is non-MDS and having access to any k nodes out of existing 2k nodes does not ensure 
restoring the original file. Each data collector has two possible strategies for selecting k storage nodes to connect to: (i) DC 
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Fig. 2. The repair problem of a (n, k) = (10, 5) code. The lost packet di can be repaired by the use of three packets including its related parity packet i.e, 
d2 + ds -\- d4 -\- . Also when d2 + ds -\- d4 -\- d^ has failed di can be reconstructed by the use of four nodes from another partitions. 



can connect to both systematic node and parity node from a partition and k — 2 different nodes selected from k — 2 different 
partitions out of the A: — 1 remaining partitions (soHd-Hnes in Fig. \T\ are a specific case of this scenario). When using this 
strategy there are 

G-l) Cl') 2'"' = (^)(^-l)2'-'. (2) 

options for DC to choose k nodes to connect to. (ii) DC can connect to 2m < k parity nodes and k — 2m systematic nodes 
selected from k different partitions, (dashed-lines in Fig. \T\ can be considered as a specific case of this scenario i.e. m = 0). 
With strategy (ii), number of possible ways to choose k nodes is computed as 

L) = ■ 

Thus, there totally exist 2^~^(/c^ — k -\- 2) ways to recover the original file using k nodes. Considering the two possible 
strategies, the proposed (2/c, k) code can tolerate any three node failures. Moreover, this code can tolerate up to /c — 1 node 
failures if these nodes are failed from k — 1 different partitions. 

As discussed, the storage per node for storing a file of size M is ^ which is equivalent with standard MDS codes and 
Minimum Storage Regenerating (MSR) codesQ However, for 2k ^ = 2M total storage, MSR and standard MDS codes offer 
higher reliability. Recall to keep the reliability same across time, each failed node should be repaired. In the naive method that 
can be used to any MDS code, a single node repair can be done after transferring the whole data file over the network (the 
repair bandwidth is equal to M). Regenerating codes can reduce the repair bandwidth if we allow the new node connect to 
d > k nodes. Our goal is reduce the repair bandwidth compared to the naive method when new node connects to d < k 
nodes. The following section aims at addressing the repair model of suggested code. 

III. Repair PROBLEM 

Note that when a node is failed or leaves the system a new node is initiated, attempting to connect to existing nodes to 
reconstruct the failed node. During the course of repairing a damaged node, we face two scenarios: (i) The parity or systematic 

^The identified tradeoff curve in 1 15 | has two extremal points; one end of this curve corresponds to the minimum storage per node and the other end 
corresponds to minimum bandwidth point. These two extremal points can be achieved by the use of Minimum Storage Regenerating (MSR) and Minimum 
Bandwidth Regenerating (MBR) codes, respectively. 
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node which has common partition with the failed node (related node) is active and (ii) The related node has failed. In the case 
of existence of the related node, the failed node can be reconstructed by communicate to only three nodes i.e., the related node 
and both parity node and systematic node from another active partition. In fact, there are /c — 1 ways to repair a failed node 
through downloading from only three nodes. For example, referring to Fig. [21 we assume that the systematic node Si which 
holds data fragment di is failed. When parity node Pi which stores parity packet d2 -\- ds -\- -\- d^ is active, the new node 
can restore di through downloading three packets in /c — 1 = 4 ways as 

((^2 + <i3 + + d^) + (<^2) + {di + (is + (^4 + d^) 
{d2 + + + d^) + (ds) + {di + ^2 + + d^) 
{d2 + c/3 + C^4 + d^) + (d^) + {di + ^2 + + d^) 
{d2 + (is + (^4 + d^) + ((is) + {di + (i2 + (is + <i4) 

Thus, in scenario (i), three nodes are involved during the course of downloading for reconstructing a new node. This leads to 
have a repair bandwidth equal to 3^. 

As discussed earlier, in MSR codes the new node should connect d > k nodes to ensure reconstructing a failed node. In 
these codes the repair bandwidth is computed as j^^^J^^j^-^-^ which is a decreasing function with respect to d lITSil and, hence, 
when new node connects to the minimum possible nodes i.e., k nodes, the repair bandwidth takes its maximum value i.e., M. 
For instance, in a (n, k) = (10, 5) MSR code, the repair bandwidth ^ can be achieved if new node connects to d = 6 nodes 
which is greater than d = 3 nodes in the proposed scheme. 

For the case of scenario (ii), we can consider two different strategies. In the first strategy, dubbed strategy A, first the parity 
node is repaired and then used to reconstruct the related systematic node. For recreating the parity node without using the 
related systematic node, the new node should connect 2m parity nodes and k — 1 — 2m systematic nodes over k — 1 different 
partition and there are 



m=0 



k-1 



2/C-2 ^ ^4>j 



ways to choose these nodes. For instance, referring Fig. [21 in a (10, 5) code there are 2^^~^^ = 8 ways in which the new node 
can use 0,2 or 4 parity nodes to repair parity node Pi which stores d2 -\- ds -\- d^ -\- d^ without the use of node Si . These eight 
ways are as 

{d2) + (c/s) + (^^4) + {dr,) 
{di + ds + c^4 + d^) + {di + ^2 + c^4 + d^) + (d^) + (d^) 
{di + ds + 6^4 + d^) + {di + ^2 + c^s + d^) + {ds) + (^5) 
{di + (is + (i4 + d^) + {di + (i2 + (is + <i4) + {ds) + {d^) 
{di + ^2 + 6/4 + d^) + {di + ^2 + c^s + d^) + (^2) + (d^) 
{di + (i2 + (i4 + d^) + {di + (i2 + (is + <i4) + (<i2) + (<i4) 
{di + (i2 + (is + d^) + {di + (i2 + (is + <i4) + (<i2) + {ds) 
{di + ds + c^4 + d^) + {di + ^2 + c^4 + d^) + 
{di + ^2 + G^s + d^) + {di + ^2 + c^s + d^) 

In the second strategy, called strategy B, first the failed systematic node is repaired and then involved in the reconstruction of 
the related parity node. When strategy B is employed, the new node requires to communicate with 2m + 1 parity nodes and 
k — 2m — 2 systematic nodes over k — 1 different partitions and the number of possible ways to select these surviving nodes 
are computed as 

f ^ " M = 2^-2 . (5) 
V2m -ij ^ ^ 



E 



For example, as can be seen in Fig. [21 there exist 2*^^ = 8 options for the new node to choose one or three parity nodes 
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for the repair of systematic node Si which holds di when Pi can not be involved. These options are as 

{di + (is + + 4) + {ds) + (c^4) + (4) 

{di +^2 + 6^4 + ds) + (^2) + {d^) + {ds) 
{di +^2+6^3+ ds) + (c^2) + {ds) + (c^s) 

(C^l + ^2 + + d^) + (C^2) + (C^3) + (C^4) 

{di ^ds^d^^ ds) + ((ii + (i2 + + c^s) + 
((ii + (i2 + c^s + ds) + (c^s) 

(dl +6^3+^4 + ds) + (C^i + ^2 + C?4 + 4) + 

{di + ^2 + C^3 + C^4) + (d^) 

{di +^3 + 6^4 + c^s) + {di + (i2 + + c^s) + 
(til + (i2 + + c^4) + {ds) 

[di + (i2 + c^4 + 4) + {di + (i2 + c^s + 4) + 

{di + ^2 + C?3 + C?4) + (C?2) 

In the both strategies k — 1 nodes are involved during the course of downloading for creating a new node. As discussed, 
then this node accompanying two other nodes is used to reconstruct its related node. Therefore, a repair bandwidth of size 
+ ^ = is consumed to repair two failed node from a partition. In fact, the average repair bandwidth for 

reconstructing each node is ^^^2k^ • 

As an example, in (10, 5) code, two failed packets di and d2 -\- ds -\- d^ -\- ds are reconstructed after downloading from totally 
7 nodes (i.e., 3.5 nodes for each packet), and consuming a repair bandwidth of size ^ (i.e., ^^y^ for each packet). Recall 
in a (10,5) MSR code a new node is allowed to contact to at least five nodes which leads to a repair bandwidth of size M. It 
is worth mentioning that, for two specific cases k = 2 and /c = 3 the reconstruction of a lost packet through communicating 
with k — 1 nodes (strategy A and B) is more efficient than three nodes because in these cases is smaller than 

Note the total storage for a file of size M regardless of k is 2M and we can reduce repair bandwidth having increase in k. 
Therefore, for a given total storage the suggested scheme can establish a tradeoff between the repair bandwidth and the number 
of storage nodes. Moreover, the number of nodes which are involved during the repair of a single node failure regardless of 
k is three. 



IV. Conclusion 

This paper aims at introducing a non-MDS scheme which is applicable in storage systems. Our proposed code which entails 
k partitions, each one consisting two related systematic and parity nodes, can tolerate any three node failures. Also it can 
tolerate any k — 1 node failures if at most two of them being from a common partition. Moreover, each single node failure 
can be repaired through access to three nodes. The suggested code has a simplicity of implementation in such that each node 
stores only one packet and the recovery of the original data file and the reconstruction of a lost packet can be achieved by 
XORing the stored packets. 
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